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Abstract 

The  Air  Force  Combat  Climatology  Center  (AFCCC)  is  tasked  to  provide  long-range  seasonal  forecasts 
for  worldwide  locations.  Currently,  the  best  long-range  temperature  forecasts  the  weather  community  has 
are  the  climatological  standard  normals.  This  study  creates  a  stepping-stone  into  the  solution  of  long-range 
forecasting  by  finding  a  process  to  predict  temperatures  better  than  those  using  climatological  standard 
normals  or  simple  frequency  distributions  of  occurrences.  Northern  Hemispheric  teleconnection  indices 
and  the  standardized  Southern  Oscillation  index  are  statistically  compared  to  three-month  summed 
Heating  Degree  Days  (HDDs)  and  Cooling  Degree  Days  (CDDs)  at  14  U.S.  locations.  First,  linear 
regression  was  accomplished.  The  results  showed  numerous  valid  models,  however,  the  percent  of 
variance  resolved  by  the  models  was  rarely  over  30%.  The  HDDs  and  CDDs  were  then  analyzed  with 
Data-mining  classification  tree  statistics,  however,  the  results  proved  difficult  to  extract  any  predictive 
quantitative  information.  Finally  a  Data-mining  regression  tree  analysis  was  performed.  At  each 
conditional  outcome,  a  range  of  HDDs/CDDs  is  produced  using  the  predicted  standard  deviations  about 
the  mean.  Verification  of  independent  teleconnection  indices  was  used  as  predictors  in  the  conditional 
model;  90%  of  the  resulting  HDDs/CDDs  fell  into  the  calculated  range.  An  overall  average  reduction  in 
the  forecast  range  was  35.7%  over  climatology 
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Abstract 

The  Air  Force  Combat  Climatology  Center  (AFCCC)  is  continually  tasked  to 
provide  temperature  and  other  long-range  seasonal  forecasts  for  locations  at  which 
Department  of  Defense  (DoD)  personnel  are  performing  long-range  exercises  and  real- 
world  mission  planning  support.  DoD  needs  long-range  forecasts  to  estimate  how  much 
fuel  is  necessary  to  keep  energy  production,  purchases  and  operations  at  the  proper  levels 
to  accommodate  all  the  energy  needs  on  their  installations  and  within  their  worldwide 
theaters  of  operation.  Currently,  the  best  long-range  temperature  forecasts  the  weather 
community  has  for  worldwide  locations  use  either  climatological  standard  nonnals  or 
simple  frequency  distributions  of  occurrences.  This  study  creates  a  stepping-stone 
toward  the  solution  of  long-range  temperature  forecasting  by  finding  a  process  to  predict 
more  accurate  temperatures  than  those  forecasts  obtained  using  climatological  standard 
normals  or  simple  frequency  distributions  of  occurrences.  This  same  solution  is  also 
highly  sought  after  by  many  non-DoD  users  as  well. 

Northern  Hemispheric  teleconnection  indices,  created  by  rotated  principle 
component  analysis  (RPCA),  and  the  standardized  Southern  Oscillation  index  are 
statistically  compared  to  Heating  Degree  Days  (HDDs)  and  Cooling  Degree  Days 
(CDDs)  at  14  U.S.  locations.  HDDs  and  CDDs  were  summed  over  three-month  periods 
to  compute  seasonal  summations.  Teleconnection  indices  found  to  be  leading  modes, 
using  RPCA,  in  a  particular  month  are  compared  to  the  HDD/CDD  summations  of  the 
following  three  months  in  order  to  create  predictive  models. 
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First,  linear  regression  is  accomplished  on  the  data.  The  results  show  numerous 
valid  modes,  however,  the  percent  of  HDD  and  CDD  variance  resolved  by  the  modes  is 
rarely  over  30%.  The  HDDs  and  CDDs  are  then  categorized  and  analyzed  with  a 
classification  tree  data-mining  program,  however,  the  results  did  not  show  any  predictive 
quantitative  infonnation. 

A  regression  tree  data  mining  analysis  is  then  performed  on  the  uncategorized 
HDDs/CDDs,  which  shows  excellent  conditional  predictive  outcomes.  At  each 
conditional  outcome,  a  range  of  HDDs/CDDs  is  produced  using  the  predicted  standard 
deviations  about  the  mean.  When  teleconnection  indices  were  used  as  predictors  in  the 
conditional  model,  90%  of  the  time  the  resulting  HDDs/CDDs  fell  into  the  calculated 
range.  Expected  forecast  range  reductions  over  climatology  are  then  calculated,  and  an 
overall  average  expected  forecast  range  reduction  of  35.7%  over  climatology  was 
achieved. 
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EXPLORATION  OF  TELECONNECTION  INDICES  FORLONG-RANGE 


SEASONAL  TEMPERATURE  FORECASTS 


I.  Introduction 


Background 

The  Air  Force  Combat  Climatology  Center  (AFCCC)  is  continually  tasked  to 
provide  temperature  forecasts  for  locations  at  which  Department  of  Defense  (DoD) 
personnel  are  performing  long-range  exercises  and  real-world  mission  planning  support. 
The  importance  of  these  forecasts  comes  down  to  the  cost  of  moving  equipment  and 
supplies,  aircraft  fuel  loads,  humanitarian  assistance  packages,  and  other  operational 
needs.  Commanders  require  accurate  temperature  forecasts  in  order  to  plan  equipment 
resources  necessary  to  keep  troops  safe  from  the  environmental  elements. 

Any  necessary  equipment  or  clothing  can  drastically  change  the  logistical 
requirements  of  any  mission,  which  is  measured  in  costs  and  expediency.  For  example,  a 
mission  anywhere  where  the  temperature  falls  below  freezing  requires  extra  clothing, 
heating  equipment,  heated  facilities,  additional  aircraft  maintenance  equipment,  deicing 
equipment,  etc.  A  large  mission  with  these  requirements  can  add  millions  of  dollars  to 
the  cost  of  the  deployment.  The  Gulf  War,  for  example,  cost  61  billion  dollars  (Horan, 
1997).  Troops  were  required  to  take  both  hot  and  cold  weather  clothing  items  for  the 
variety  of  weather  conditions  experienced  in  the  region  (USAF,  1991).  If  it  were  possible 
to  give  commanders  better  long-range  temperature  forecasts,  they  might  have  been  able 
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to  alleviate  taking  all  or  part  of  the  cold-weather  clothing,  saving  millions  of  dollars  and 
vitally  needed  airlift  requirements  in  the  process.  Accurate  forecasts  can  also  help  smaller 
scale  military  teams  as  well.  For  example,  special  operation  forces  deployed  in  a  country 
such  as  Afghanistan  cannot  afford  to  carry  unnecessary  equipment.  Both  cold  mountain 
areas  and  hot  desert  areas  dominate  the  Afghanistan  terrain.  Accurate  long-range 
forecasts  are  vital  to  their  mission  success  as  well  as  the  success  of  the  massive  airlift 
operations  required  to  support  the  war  effort. 

Long-range  temperature  forecasting  is  not  only  important  in  mission  planning,  but 
also  for  planning  fuel  costs  for  energy  consumption.  The  DoD,  just  like  the  general 
population,  needs  to  forecast  how  much  fuel  is  necessary  to  keep  energy  production  and 
purchases  at  the  proper  levels  to  accommodate  all  the  energy  needs  on  their  installations 
and  in  their  worldwide  theaters  of  operation.  This  can  become  very  difficult,  especially  if 
there  are  significant  temperature  anomalies,  such  as  periods  of  extreme  hot  or  cold 
conditions.  When  there  are  significant  temperature  anomalies,  there  is  usually  not 
enough  fuel  to  maintain  the  amount  of  energy  being  consumed.  The  better  the  long-range 
temperature  forecasts  are,  the  better  the  initial  estimates  of  needed  fuel  reserves  for 
energy  use.  In  addition,  it  is  hoped  the  improvement  of  long-range  temperature  forecasts 
may  lead  to  improved  long-range  forecasts  of  other  climatic  elements. 

Long-range  weather  forecasting 

Lorenz  saw  his  initial  weather  patterns  grow  farther  and  farther  apart  in  model 
simulations  until  all  resemblance  to  each  other  had  disappeared.  He  decided  that  long- 
range  weather  forecasting  must  be  doomed  (Gleick,  1987).  Today,  it  is  thought  that 
numerical  models  are  not  valid  after  the  15-day  point  (Anthes,  1986).  Clearly  the 
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immediate  future  of  long-range  weather  forecasting  does  not  lie  with  the  use  of  short- 
range  numerical  weather  prediction  models. 

Baur  (1951)  suggested  long-range  weather  forecasting  could  be  possible  using 
large-scale  spatial  circulation  patterns,  which  he  termed  Grosswetterlagen.  Since  then, 
countless  studies  compared  large-scale  weather  patterns  with  weather  parameters  around 
the  world.  Most,  however,  do  not  try  to  use  the  patterns  as  forecast  tools.  This  research 
attempts  to  look  at  forecasting  long-range  temperatures  by  using  techniques  similar  to  the 
Grosswetterlagen  method,  using  global  teleconnection  patterns  (Wallace  and  Gutzler, 
1981).  This  research  attempts  to  take  the  concept  Baur  had  and  use  today’s  technology  to 
make  forecasts  once  thought  impossible. 

Currently,  the  best  long-range  temperature  forecasts  the  weather  community  has  for 
worldwide  locations  are  the  climatological  standard  normals,  which  are  averages  of 
climatological  data  calculated  for  the  following  consecutive  30-year  periods,  established 
by  international  agreement:  1  January  1901  to  31  December  1930;  1  January  1931  to  31 
December  1960;  1  January  1961  to  31  December  1990;  etc.  (Glickman,  2001).  The  U.S 
Climate  Prediction  Center  (CPC)  calculates  standard  nonnals  for  U.S.  stations  at  the  end 
of  each  decade  (CPC,  2001).  However,  temperature  anomalies,  which  are  the  most 
important  features  in  long-range  mission  and  energy  planning,  are  smoothed  out  or 
unseen  over  such  30-year  averages. 

This  research  focuses  on  finding  a  process  to  predict  more  accurate  temperatures 
than  those  obtained  by  using  climatological  standard  normals  or  simple  frequency 
distributions  of  occurrences.  This  study  investigates  the  relationships  between 
temperature  and  known  global  teleconnection  patterns.  Finding  a  significant  relationship 
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that  affects  DoD  missions  and  energy  consumption  might  possibly  save  DoD  billions  of 
dollars  per  year. 

Scope  of  Research 

Any  forecasting  tool  needs  to  be  reproducible  and  readily  available  for  users  in 
the  field,  without  a  great  deal  of  trouble  gaining  necessary  data.  For  this  reason,  the 
National  Center  for  Environmental  Prediction’s  (NCEP)  CPC’s  Standardized  Northern 
Hemisphere  Teleconnection  Indices  and  the  Southern  Oscillation  Index  are  used  in  this 
research.  CPC’s  indices  are  produced  monthly  and  are  available  to  users  on  their  web 
site:  http://www.cpc.ncep.noaa.gov.  This  research  investigates  statistical  methods  of 
using  these  monthly  indices  to  predict  U.S.  seasonal  temperatures  from  one  to  three 
months  in  advance. 

One  way  to  represent  temperature  forecasts  over  a  period  of  time  is  taken  from  the 
civil  engineering  community.  Their  primary  need  is  a  means  to  relate  temperatures  to  the 
demand  for  fuel  consumption  over  a  specific  period  of  time,  and  they  utilize  Heating 
Degree-Days  (HDDs)  or  Cooling  Degree-Days  (CDDs)  in  this  effort. 

This  research  uses  various  statistical  software  packages  to  explore  any  relationships 
between  teleconnection  indices  and  HDDs/CDDs  for  14  locations  that  have  current 
temperature  data  available.  To  ensure  the  utmost  quality  of  the  temperature  data  used, 
only  U.S.  first-order  stations  are  used  in  this  analysis.  All  of  the  cities  have  different 
periods  of  record  for  their  temperature  data,  and  the  teleconnection  data  is  only  from 
1950  to  present. 
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Research  Objectives 


The  goal  of  this  research  is  to  use  known  significant  teleconnection  indices  to 
create  a  predictive  tool  for  forecasting  long-range  temperature  patterns  over  the  U.S.  The 
specific  objectives  necessary  to  achieve  this  goal  are: 

1.  to  gather  temperature  data  from  14  locations  across  the  U.S.  in  order  to 
represent  most  climatic  regimes  across  the  country; 

2.  to  calculate  and  compile  monthly  HDDs  and  CDDs  values  from  this  data; 

3.  to  gather  teleconnection  indices  from  the  14  most  significantly  known 
Northern  Hemisphere  teleconnections  and  the  Southern  Oscillation  Index  in 
the  Southern  Hemisphere; 

4.  to  remove  ten  years  of  the  data  for  later  verification  of  any  relationship 
identified; 

5.  to  analyze  data  with  a  thorough  regression  analysis  to  find  any  significant 
relationships  between  monthly  teleconnection  indices  and  the  summation  of 
HDDs/CDDs  for  the  following  three  months; 

6.  to  use,  if  necessary,  data  mining  techniques  to  find  any  predictive 
relationships  if  standard  statistical  methods  fail; 

7.  to  create  predictive  tools  using  monthly  teleconnection  indices  as  the 
predictor  and  summed  HDDs  and  CDDs  seasons  as  the  predictand  for  any 
relationships  found; 

8.  to  verily  any  predictive  models  developed  by  using  ten  years  of 
independent  data  not  included  in  creating  the  predictive  models  and, 

9.  to  investigate  the  spatial  homogeneity  of  the  created  prediction  trees. 
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II.  Literature  Review 


Rotated  Principle  Component  Analysis  (RPCA) 

The  method  used  for  defining  the  low-frequency  teleconnection  patterns  in  this 
study  is  that  of  Rotated  Principal  Component  Analysis  (RPCA).  RPCA  is  considered  to 
be  superior  to  using  distinct  centers  of  geopotential  height  anomalies  at  select  locations, 
in  that  the  teleconnection  patterns  identified  are  based  on  the  entire  flow  field,  and  not 
just  from  height  anomalies  at  the  selected  locations  (Rodionov  and  Assel,  2000). 

RPCA  uses  the  eigenvectors  of  the  cross-correlation  (or  cross-covariance)  matrix 
from  the  time  variations  of  the  grid-point  values  of  the  700-mb  height  anomalies,  and 
ranks  the  eigenvectors  according  to  the  amount  of  total  variance  they  explain  (creating  a 
PCA).  The  PCA  is  then  orthogonally  rotated  to  get  the  variances  as  close  to  zero  as 
possible  (Barnston  and  Livezey,  1987).  Bamston  and  Livezey  (1987)  used  the  RPCA 
technique  to  calculate  the  10  most  prominent  teleconnection  patterns  in  each  month.  This 
procedure  isolates  the  primary  teleconnection  patterns  for  all  months  and  allows  for  a 
time  series  of  the  amplitudes  of  the  patterns  to  be  constructed. 

CPC  uses  the  Barnston  and  Livezey  method  by  applying  the  RPCA  technique  to 
monthly  mean  700-mb  height  anomalies  between  January  1964  and  July  1994.  In  CPC’s 
analysis,  ten  patterns  are  determined  for  each  calendar  month  by  using  all  of  the  height 
anomaly  fields  for  the  three-month  period  centered  on  that  month.  For  example,  the  July 
patterns  are  calculated  based  on  the  June  through  August  anomaly  fields  (CPC,  2001). 
Using  RPCA  instead  of  PCA  creates  solutions  that  have  a  physical  meteorological 
interpretability.  The  RPCA  solutions  also  involve  much  smaller  areas  of  the  hemisphere 
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(Bamston  and  Livezey,  1987).  A  more  comprehensive  discussion  of  rotated  principal 
component  solutions  is  found  in  Horel  (1981)  and  Bamston  and  Livezey  (1987). 
Northern  Hemispheric  Teleconnection  patterns 

Teleconnection  patterns  are  macro-P  scale  patterns  resembling  standing  waves 
with  geographically  fixed  centers  (Horel,  1981).  They  are  also  referred  to  as  preferred 
modes  of  low-frequency  variability  (CPC,  2001),  and  several  teleconnection  patterns  in 
planetary  circulation  have  been  documented  by  Barnston  and  Livezey  (1987).  A 
comprehensive  re-analysis  of  Northern  Hemispheric  variability  patterns  has  been 
undertaken  by  CPC  using  newly  available  700hPa  height  data  (Washington  et  ah,  2000) 
to  achieve  a  better  understanding  in  the  synoptic  weather  patterns  related  to  the 
teleconnection  patterns. 

The  13  prominent  Northern  Hemispheric  teleconnection  patterns  used  in  this 
study  are  separated  into  three  regions;  patterns  over  the  North  Atlantic,  patterns  over 
Eurasia,  and  patterns  over  North  Pacific/  North  America.  The  prominent  patterns  over 
the  North  Atlantic  are:  the  North  Atlantic  Oscillation  (NAO),  the  East  Atlantic  Pattern 
(EA),  and  the  East  Atlantic  Jet  Pattern  (EA-JET).  The  prominent  patterns  over  Eurasia 
are:  the  East  Atlantic/West  Russia  Pattern  (EA/WR),  the  Scandinavian  Pattern  (SCAD), 
the  Polar/Eurasia  Pattern  (POL)  and  the  Asian  Summer  Pattern  (ASU).  The  prominent 
patterns  over  the  North  Pacific/North  America  are:  West  Pacific  Pattern  (WP),  the  East 
Pacific  Pattern  (EP),  the  North  Pacific  Pattern  (NP),  the  Pacific/North  American  Pattern 
(PNA),  the  Tropical/Northem  Hemisphere  Pattern  (TNH),  and  the  Pacific  Transition 
Pattern  (PT). 
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The  North  Atlantic  Oscillation  (NAO),  shown  in  Figure  1,  is  one  of  the  dominant 
modes  of  Northern  Hemispheric  climate  variability  (Walker  and  Bliss,  1932;  Van  Loon 
and  Rogers,  1978;  Wallace  and  Gutzler,  1981;  Washington  et  al.,  2000)  and  is  a  leading 
mode  in  all  months  (Barnston  and  Livezey,  1987;  Washington  et  ah,  2000).  The  NAO 
exhibits  little  variation  in  its  climatological  mean  structure  from  month-to-month,  and 
consists  of  a  north-south  dipole  of  anomalies,  with  one  center  over  the  Greenland/Iceland 
region  and  the  other  center,  of  opposite  sign,  spanning  the  central  latitudes  of  the  North 
Atlantic  around  the  Azores  between  35°N  and  40°N.  The  positive  phase  of  the  NAO 
reflects  below-normal  heights  and  pressure  across  the  high  latitudes  of  the  North  Atlantic 
and  above-normal  heights  and  pressure  over  the  central  North  Atlantic,  the  eastern  United 
States  and  Western  Europe.  The  negative  phase  reflects  an  opposite  dipole  pattern  of 
height  and  pressure  anomalies  over  these  regions  (Washington  et  al.,  2000;  CPC,  2001). 
Strong  positive  phases  of  the  NAO  tend  to  be  associated  with  above-normal  temperatures 
in  the  eastern  United  States  and  across  northern  Europe  and  with  below-nonnal 
temperatures  in  Greenland  and  oftentimes  across  southern  Europe  and  the  Middle  East 
(CPC,  2001). 

The  EA  pattern,  shown  in  Figure  2,  is  a  prominent  mode  of  low-frequency 
variability  over  the  North  Atlantic.  It  is  a  prominent  mode  in  all  months  except  May- 
August.  It  consists  of  a  north-south  dipole  of  anomaly  centers,  which  span  the  entire 
North  Atlantic  Ocean  from  east  to  west  with  the  zero  line  always  positioned  over  England 
or  France.  The  EA  pattern  is  structurally  similar  to  the  NAO  pattern;  however,  the 
anomaly  centers  are  displaced  southeastward  to  the  approximate  nodal  lines  of  the  NAO 
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pattern.  The  lower-latitude  center  contains  a  strong  subtropical  link,  reflecting  large- 
scale  modulation  in  the  strength  and  location  of  the  subtropical  ridge  (CPC,  2001). 


NORTH  ATLANTIC  OSCILLATION  (NAO) 

January  April 


-75  -50  -25  25  50  75 


Figure  1 .  Phases  of  the  NAO  pattern.  From  positive  phase  in  January  to  negative  phase 
in  July.  Values  are  scaled  to  be  correlations  between  the  average  700-mb  height 
anomalies  at  a  given  grid  point  and  the  principal  component  amplitude  (modified  from 

CPC,  2001). 
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Figure  2.  Phases  of  the  EA  pattern  (modified  from  CPC,  2001). 

The  EA-JET  pattern,  shown  in  Figure  3,  is  a  prominent  mode  of  North  Atlantic 
variability,  appearing  between  April  and  August.  This  pattern  also  consists  of  a  north- 
south  dipole  of  anomaly  centers,  with  one  main  center  located  over  the  high  latitudes  of 
the  eastern  North  Atlantic  and  Scandinavia,  and  the  other  center  located  over  Northern 
Africa  and  the  Mediterranean  Sea.  A  positive  phase  of  the  EA-Jet  pattern  reflects  an 
intensification  of  the  westerlies  over  the  central  latitudes  of  the  eastern  North  Atlantic 
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and  over  much  of  Europe,  while  a  negative  phase  reflects  a  strong  split- flow 
configuration  over  these  regions,  sometimes,  in  association  with  long-lived  blocking 
anticyclones  in  the  vicinity  of  Greenland  and  Great  Britain  (CPC,  2001). 


EAST  ATLANTIC  JET  (EA-JET) 

April 


July  October 
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Figure  3.  Phases  of  the  EA-JET  pattern  (modified  from  CPC,  2001). 


The  EA/WR  pattern,  shown  in  Figure  4,  is  one  of  two  prominent  modes  that 
affect  Eurasia  during  most  of  the  year.  This  pattern  is  prominent  in  all  months  except 
June- August.  In  winter,  two  main  anomaly  centers,  located  over  the  Caspian  Sea  and 
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Western  Europe,  comprise  the  East  Atlantic/West  Russian  pattern.  A  three-celled 
pattern  is  then  evident  in  the  spring  and  fall  seasons,  with  two  main  anomaly  centers 
of  opposite  sign  located  over  western/northwestern  Russia  and  over  northwestern 
Europe.  The  third  center,  having  the  same  sign  as  the  Russia  center,  is  located  off  the 
Portuguese  coast  in  spring,  but  exhibits  a  northern  movement  toward  Newfoundland 
in  the  fall  (CPC,  2001). 


EAST  ATLANTIC/  WEST  RUSSIA  (EATL/WRUS) 

January  April 
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Figure  4.  Phases  of  the  EA/WR  pattern  (modified  from  CPC,  2001). 
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The  SC  AND  pattern,  shown  in  Figure  5,  consists  of  a  primary  circulation 
center,  which  spans  Scandinavia  and  large  portions  of  the  Arctic  Ocean  north  of 
Siberia.  Two  additional  weaker  centers  with  opposite  sign  to  the  Scandinavia  center 
are  located  over  Western  Europe  and  over  the  Mongolia  and  the  western  China  sector. 
The  positive  phase  of  this  pattern  is  associated  with  positive  height  anomalies, 
sometimes  reflecting  major  blocking  anticyclones  over  Scandinavia  and  western 
Russia,  while  the  negative  phase  of  the  pattern  is  associated  with  negative  height 
anomalies  over  these  same  regions  (CPC,  2001). 


SCANDINAVIA  (SCAND) 
January  April 
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Figure  5.  Phases  of  SCAND  pattern  (modified  from  CPC,  2001). 
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The  POL  pattern,  shown  in  Figure  6,  appears  only  in  the  winter,  and  is  the  most 
prominent  mode  of  low-frequency  variability  during  December  and  February.  The 
pattern  consists  of  one  main  anomaly  center  over  the  polar  region,  and  separate  centers  of 
opposite  sign  to  the  polar  anomaly  over  Europe  and  northeastern  China.  Thus,  the  pattern 
reflects  major  changes  in  the  strength  of  the  circumpolar  circulation,  and  reveals  the 
accompanying  systematic  changes  that  occur  in  the  midlatitude  circulation  over  large 
portions  of  Europe  and  Asia  (CPC,  2001). 

POLAR/  EURASIAN  PATTERN 

January 
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Figure  6.  The  POL  pattern  (modified  from  CPC,  2001). 

The  ASU  pattern,  shown  in  Figure  7,  is  a  broad,  east-west  center  in  central  Asia 
(Bamston  and  Livezey,  1987).  The  Asian  Summer  pattern  is  only  a  leading  mode  during 
the  summer  months  of  June- August.  The  pattern  is  monopole  in  nature  with  anomalies  of 
the  same  sign  observed  throughout  southern  Asia  and  northeastern  Africa.  A  positive 
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phase  of  the  pattern  is  indicated  by  above-normal  heights  throughout  southern  Asia  and 
northeastern  Africa  (CPC,  2001).  The  above  normal  heights  are  thought  to  be  due  to  the 
intense  heating  over  the  Tibetan  Plateau.  It  is  theorized  that  in  years  with  higher  amounts 
of  insolation  over  the  plateau,  the  entire  ITCZ  over  Africa  and  Asia  is  pulled  further  north 
thus  affecting  the  circulation  over  the  entire  Asian  continent  (Lowther,  1998). 


ASIAN  SUMMER 

July 
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Figure  7.  Positive  phase  of  ASU  pattern  (modified  from  CPC,  2001). 

The  WP  pattern,  shown  in  Figure  8,  is  a  primary  mode  of  low-frequency 
variability  over  the  North  Pacific  throughout  all  months  (Washington  et  ah,  2000; 
Barnston  and  Livezey,  1987;  Wallace  and  Gutzler,  1981).  During  winter  and  spring,  the 
pattern  consists  of  a  north-south  dipole  of  anomalies,  with  one  center  located  over  the 
Kamchatka  Peninsula  and  another  broad  center  of  opposite  sign  covering  portions  of 
southeastern  Asia  and  the  lower  latitudes  of  the  extreme  western  North  Pacific.  Strong 
positive  or  negative  phases  of  this  pattern  reflect  pronounced  zonal  and  meridional 
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Figure  8.  Phases  of  the  WP  pattern  (modified  from  CPC,  2001). 


The  EP  pattern,  shown  in  Figure  9,  is  evident  in  all  months  except  August  and 
September  and  reflects  a  north-south  dipole  of  height  anomalies  over  the  eastern  North 
Pacific.  The  northern  center  is  located  in  the  vicinity  of  Alaska  and  the  west  coast  of 
Canada,  while  the  southern  center  is  of  an  opposite  sign  and  is  found  near,  or  east  of, 
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Hawaii.  During  strong  positive  phases  of  the  EP  pattern,  a  deeper  than  normal  trough  is 
located  in  the  vicinity  of  the  Gulf  of  Alaska  or  western  North  America,  and  positive 
height  anomalies  are  observed  further  south.  A  strong  negative  phase  of  the  EP  pattern  is 
associated  with  a  pronounced  split-flow  configuration  over  the  eastern  North  Pacific, 
with  reduced  westerlies  over  the  region  (CPC,  2001). 


EAST  PACIFIC  PATTERN  (EP) 
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Figure  9.  Phase  of  the  EP  pattern  (modified  from  CPC,  2001). 
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The  NP  pattern,  shown  in  Figure  10,  is  prominent  from  March  through  July.  This 
pattern  consists  of  a  primary  anomaly  center,  which  spans  the  central  latitudes  of  the 
western  and  central  North  Pacific,  and  a  weaker  anomaly  region  of  opposite  sign,  which 
spans  eastern  Siberia,  Alaska,  and  the  western  mountain  regions  of  North  America. 
Overall,  pronounced  positive  phases  of  the  NP  pattern  are  associated  with  a  southward 
shift  and  intensification  of  the  Pacific  jet  stream  from  eastern  Asia  to  the  eastern  North 
Pacific,  followed  downstream  by  an  enhanced  anticyclonic  circulation  over  western 
North  America,  and  by  an  enhanced  cyclonic  circulation  over  the  southeastern  United 
States.  Pronounced  negative  phases  of  the  NP  pattern  are  associated  with  circulation 
anomalies  of  opposite  sign  in  these  same  regions  (CPC,  2001). 


April 


Figure  10.  Phases  of  NP  pattern  (modified  from  CPC,  2001). 
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The  PNA  pattern,  shown  in  Figure  1 1,  is  perhaps  the  best-known  mode  of  Pacific- 
based  variability.  It  appears  in  all  months  except  June  and  July.  The  PNA  pattern 
reflects  a  quadripole  pattern  of  geopotential  anomalies,  with  anomalies  of  similar  sign 
located  south  of  the  Aleutian  Islands  and  over  the  southeastern  USA.  Anomalies  with 
signs  opposite  to  the  Aleutian  center  are  located  near  Hawaii  and  over  central  Canada 
during  the  winter  and  autumn  (CPC,  2001;  Washington  et  ah,  2000;  Barnston  and 
Livezey,  1987). 


PACIFIC/NORTH  AMERICAN  PATTERN  (PNA) 
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Figure  11.  Phases  of  PNA  pattern  (modified  from  CPC,  2001). 
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The  TNH  pattern,  shown  in  figure  12,  appears  as  a  prominent  mode  from 
November-February.  The  pattern  consists  of  one  primary  anomaly  center  over  the  Gulf 
of  Alaska  and  a  separate  anomaly  center  of  opposite  sign  over  Hudson  Bay.  A  weaker 
area  of  anomalies  having  the  same  sign  to  the  Gulf  of  Alaska  anomaly  extends  across 
Mexico  and  the  extreme  southeastern  United  States.  This  pattern  reflects  large-scale 
changes  in  both  the  location  and  eastward  extent  of  the  Pacific  jet  stream,  and  also  in  the 
mean  strength  and  position  of  the  climatological  Hudson  Bay  low.  This  pattern 
significantly  modulates  the  flow  of  marine  air  into  North  America,  as  well  as  the 
southward  transport  of  cold  Canadian  air  into  the  north-central  U.  S.  (CPC,  2001). 

TROPICAL/  NORTHERN  HEMISPHERE  PATTERN 
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Figure  12.  Phases  of  TNH  pattern  (modified  by  CPC,  2001). 

The  PT  pattern,  shown  in  Figure  13,  is  prominent  between  May-August.  The 
mode  consists  of  a  pattern  of  height  anomalies,  which  extends  from  the  Gulf  of  Alaska 
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eastward  to  the  Labrador  Sea  and  is  aligned  along  the  40°N  latitude  circle.  The 
prominent  centers  of  action  have  a  similar  sign  and  are  located  over  the  intermountain 
region  of  the  United  States  and  over  the  Labrador  Sea.  Relatively  weak  anomaly  centers 
with  signs  opposite  to  the  above  are  located  over  the  Gulf  of  Alaska  and  over  the  eastern 
United  States  (CPC,  2001). 


PACIFIC  TRANSITION  PAHERN 
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Figure  13.  Phases  of  PT  pattern  (modified  by  CPC,  2001). 

Southern  Hemispheric  Teleconnection  Pattern 

“When  the  pressure  is  high  in  the  Pacific  Ocean,  it  tends  to  be  low  in  the  Indian 
Ocean  from  Africa  to  Australia.”  This  is  how  Sir  Gilbert  Walker  described  the  Southern 
Oscillation  (SO)  in  his  papers  in  the  1920s  and  1930s  (Burroughs,  1992).  There  are 
numerous  ways  of  recording  this  slow  see-saw  of  atmospheric  pressure  across  the 
equatorial  pacific  resulting  in  various  Southern  Oscillation  Indexes  (SOI).  This  research 
uses  the  SOI  index  created  by  the  pressure  difference  between  Tahiti,  French  Polynesia  in 
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the  mid-Pacific  and  Darwin  in  northern  Australia.  These  two  stations  represent  the 
Southeast  Pacific  area  of  high  pressure  and  the  Indonesian  low,  respectively  (Robinson 
and  Henderson-Sellers,  1999). 

Other  research 

There  are  numerous  articles  that  draw  comparisons  between  a  specific 
teleconnection  pattern  and  specific  meteorological  parameters,  but  there  are  fewer  articles 
that  use  all  of  the  teleconnections  together  for  a  comparison  toward  single  parameters. 

Of  those  that  use  multiple  teleconnections  (Washington  et  ah,  (2000),  Rodionov  and 
Assel,  (2000),  for  example),  none  attempted  to  create  predictive  relationships  between  the 
teleconnections  and  the  parameters,  thus  resulting  in  a  model  to  use  as  a  tool  in  the 
operational  field. 
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TIT.  Data  Collection  and  Review 


Northern  Hemisphere  Teleconnection  Pattern  Indices 

As  mentioned  in  Chapter  II,  the  10  most  prominent  teleconnection  patterns  in 
each  month  were  calculated  by  RPCA  (CPC,  2001).  For  each  of  the  10  patterns  in  a 
month,  CPC  calculates  a  monthly  index.  This  method  of  calculation  is  a  form  of  factor 
analysis  that  has  not  yet  been  published  by  CPC. 

Southern  Oscillation  Index 

The  Southern  Oscillation  Index  (SOI)  is  the  only  index  used  in  this  study  that  is 
not  calculated  by  the  RPCA  method.  It  is  calculated  by  using  the  raw  atmospheric 
pressure  data  from  Tahiti  and  Darwin,  Australia.  The  anomalies  used  are  departures  from 
the  1951-1980  base  period,  and  the  anomaly  for  each  city  is  defined  as: 

XA  =  ( Actual{SLP))-(  mean  (SLP ))  (1) 


where  the  XA  is  either  TA  for  the  Tahiti  anomaly  or  DA  for  the  Darwin  anomaly, 
depending  on  which  cities  anomaly  is  being  calculated,  and  SLP  is  for  the  appropriate 
location  sea  level  pressure.  The  standard  deviation  for  Tahiti  or  Darwin  is: 


Standard  Deviation  = 


(2) 


where  N  is  the  number  of  months  being  summed.  The  data  are  then  standardized  as 
follows: 


where  ST  is  standardized  Tahiti  and  SD  is  standardized  Darwin  monthly  data. 
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The  standardized  SOI,  is  then: 


SOI  = 


ST-SD 


I^(st-sd)2 

N 


(4) 


where  the  denominator  is  the  monthly  standard  deviation. 

Heating  Degree  Days  /  Cooling  Degree  Days 

To  calculate  the  HDDs  for  a  particular  day,  one  would  first  find  the  day’s  average 
temperature.  The  day’s  average  temperature  for  the  data  used  in  this  study  is  found  by: 

(  Max  tPTYW  4-  Min  tpnm ) 

- — = - - - = - ,  where  the  Max_temp  is  the  day’s  maximum  temperature  and 

Mintemp  is  the  day’s  minimum  temperature.  If  the  average  temperature  is  less  than 
65°F,  subtract  the  average  temperature  from  65°F  and  the  result  is  the  number  of  HDDs 
for  that  particular  day.  The  resulting  number  is  accumulated  over  a  month,  season,  or 
whatever  period  is  being  examined.  To  calculate  CDDs  for  a  particular  day,  one  would 
again  find  the  day’s  average  temperature.  If  the  temperature  is  greater  than  65°F,  subtract 
65  °F  from  the  average  temperature  and  the  result  is  the  number  of  CDDs  for  that  day. 

The  number  is  again  accumulated  over  the  period  in  question. 

Locations 

The  HDDs  and  CDDs  were  calculated  for  14  locations  across  the  U.S.,  shown  in 
Figure  14.  The  locations  have  a  good  history  of  temperature  data  and  make  an  excellent 
database  for  this  study.  The  locations  are:  Atlanta-Hartsfield  International  Airport, 
Georgia;  Chicago  O’Hare  International  Airport,  Illinois;  Cincinnati-Northern  Kentucky 
Airport,  Kentucky;  Dallas-Fort  Worth  International  Airport,  Texas;  Des  Moines 
International  Airport,  Iowa;  Las  Vegas  McCarran  International  Airport,  Nevada; 
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Memphis  International  Airport,  Tennessee;  Minneapolis-St.  Paul  International  Airport, 
Minnesota;  New  York  Laguardia  Airport,  New  York;  Philadelphia  International  Airport, 
Pennsylvania;  Portland  International  Airport,  Oregon;  Sacramento  Executive  Airport, 
California;  Tucson  International  Airport,  Arizona;  and  Wright-Patterson  AFB,  Ohio. 
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Figure  14.  Fourteen  U.S.  locations  from  which  HDDs  and  CDDs  are  calculated 
(modified  from  Mapquest.com,  2001). 


25 


IV.  Linear  Regression  Analysis 


Data  Manipulation  for  Linear  Regression  Analysis 

This  study  began  with  simple  regression  analysis  between  the  HDDs  and  CDDs 
for  the  14  locations  and  all  13  teleconnection  indices.  The  goal  was  to  compare  the  13 
teleconnection  indices  with  the  HDDs  and  CDDs  of  the  14  locations  for  one,  two  and 
three  months  in  the  future.  All  HDDs,  CDDs,  and  teleconnections  were  put  into  data 
vector  columns,  temporally  from  January  1950  to  December  1999,  for  13  of  the  14 
locations.  Chicago’s  data  started  in  1959,  therefore  Chicago  data  manipulations  were 
accomplished  from  this  date  forward.  The  vector  format  used  was  necessary  for  the 
statistical  program  to  properly  accomplish  regression  analysis,  but  created  missing  data 
problems.  The  monthly  teleconnection  indices  are  created  only  in  months  the 
teleconnections  are  an  RPC  A  leading  mode  (in  the  top  ten).  Except  for  the  NAO  and  SOI 
standardized  (SOI  S),  none  of  the  teleconnections  are  in  a  leading  mode  every  month  of 
the  year,  thus  the  statistical  will  not  use  the  data  if  there  is  missing  data  in  any  row  of  the 
combined  columns. 

To  correct  these  problems  12  different  matrices  were  created,  one  for  each  month 
of  the  year,  and  only  those  teleconnection  indices  that  were  RPCA  leading  modes  in  the 
specific  month  were  added  to  the  matrix.  Needing  to  compare  all  twelve  months  of 
teleconnections  with  each  location’s  HDDs  and  CDDs  for  one,  two,  and  three  months  in 
the  future  significantly,  increased  the  needed  analysis  time.  Therefore  due  to  time 
constraints,  the  data  were  combined  to  create  seasonal  values. 
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The  months  were  combined  to  create  seasons  and  then  the  seasons  were  separated  into 
categories,  shown  in  Table  1. 


Table  1.  Monthly  periods  used  in  summations  of  HDDs  and  CDDs  to  create  seasons. 


Winter  (HDD’s) 

Summer  (CDD’s) 

October-December 

OND 

April-June 

AMJ 

November-January 

NDJ 

May-July 

MJJ 

December- 

DJF 

June-August 

JJA 

February 

January-March 

JFM 

July-September 

JAS 

February-April 

FMA 

August-October 

ASO 

March-May 

MAM 

September-November 

SON 

HDDs  were  summed  into  three-month  seasons  from  October-May  and  CDDs  were 
summed  into  three-month  seasons  from  April-November.  This  process  decreased  the 
number  of  needed  comparisons.  The  goal,  at  this  point,  was  to  compare  the 
teleconnection  indices  in  RPCA  leading  modes  in  a  particular  month  with  the  summation 
of  the  next  three  month’s  HDDs  or  CDDs,  depending  on  the  month  being  compared  for 
each  location.  Before  these  comparisons  were  completed,  10  years  of  data  were 
randomly  selected  and  removed  to  create  an  independent  verification  database. 

Linear  Regression  Analysis 

Linear  regression  was  accomplished  on  the  data  using  leading  mode 
teleconnections  from  May  and  the  summed  CDDs  from  June- August.  The  first  output 
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statistic  taken  into  consideration  was  the  value  in  the  significance  column  in  the  analysis 
of  variance  (ANOVA)  table,  shown  in  Table  2.  This  is  commonly  known  as  the  p-value 
in  statistical  references  and  can  be  compared  to  the  significance  level  of  0.01.  If  the  p- 
value  is  less  than  0.01  then  at  least  one  of  the  predictors  (teleconnection  indices)  creates  a 
statistically  good  model  for  the  dependent  variable  (HDDs  or  CDDs)  at  the  0.01 
significance  level. 


Table  2.  ANOVA  table  output  from  linear  regression.  A  p-value  in  the  Sig  column  of 

less  than  0.01  indicates  a  good  model. 


ANOVA 


Model 

df 

Mean 

Square 

F 

Regression 

679422.475 

10 

67942.247 

3.825 

.002 

Residual 

515055.500 

29 

17760.534 

Total 

39 

a  Predictors:  (Constant),  SOI_S,  PNA,  EAWR,  PT,  NAO,  SCA,  EA  JET,  NP,  EP,  WP 
b  Dependent  Variable:  NUM_AT 


The  value  in  the  significant  column  of  the  coefficients  table,  shown  in  Table  3,  is  used  to 
evaluate  which  predictors  were  statistically  sound.  Those  with  p-value  greater  than  0.05 
were  eliminated  from  the  model  and  linear  regression  was  rerun.  This  procedure  was 
repeated  until  the  best  model  was  gained.  Ideally,  an  ANOVA  p-value  of  less  than  or 
equal  to  0.01  with  p-values  of  the  predictors  in  the  coefficients  table  of  less  than  or  equal 
to  0.01  result  in  the  best  model;  however,  it  was  not  always  possible  to  reach  this  goal. 
While  running  the  linear  regression,  the  Adjusted  R-squared  parameter  in  the  model 
summary  table,  shown  in  Table  4,  was  considered. 
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Table  3.  Coefficients  table  output  from  linear  regression.  A  p-value  of  0.01  in  the  Sig 
column  is  desired  for  a  significant  model.  The  predictors  with  the  greatest  p-value  were 

eliminated  and  the  analysis  was  run  again. 


Coefficients 


Unstandardized 

Coefficients 

t 

Sig. 

Model 

B 

Beta 

1 

(Constant) 

1187.300 

23.599 

50.311 

.000 

NAO 

-12.627 

23.333 

-.069 

-.541 

.593 

EA  JET 

-.907 

25.697 

-.005 

-.035 

.972 

WP 

91.944 

25.580 

.531 

3.594 

.001 

EP 

36.439 

28.100 

.174 

1.297 

.205 

NP 

56.544 

22.089 

.367 

2.560 

.016 

PNA 

-25.205 

23.185 

-.138 

-1.087 

.286 

EAWR 

-67.043 

25.475 

-.335 

-2.632 

.013 

SCA 

-22.762 

20.866 

-.139 

-1.091 

.284 

PT 

-17.405 

25.303 

-.098 

-.688 

.497 

SOI  S 

43.276 

30.210 

.221 

1.432 

.163 

a  Dependent  Variable:  NUM_AT 


Table  4.  Model  summary  table  output  from  linear  regression.  An  Adjusted  R-squared 
_ greater  than  0.60  is  desired. _ 


Model  Summary 


Model 

R 

R  Square 

Adjusted 

R  Square 

Std.  Error  of 
the  Estimate 

1 

,754a 

.569 

.420 

133.2687 

a-  Predictors:  (Constant),  SOI_S,  PNA,  EAWR,  PT,  NAO, 
SCA,  EA_JET,  NP,  EP,  WP 


The  R-squared  value  can  be  interpreted  as  the  proportion  of  the  variation  of  the 
predictand  that  is  “described”  or  “accounted  for”  by  the  regression  (Wilks,  1995).  The  R- 
squared  is  adjusted  when  there  are  multiple  predictands,  creating  the  adjusted  R-squared 
coefficient.  An  adjusted  R-squared  of  0.60  (describing  60%  of  the  predictand  variance) 
or  greater  is  the  goal  if  any  predictive  model  were  to  be  discovered.  As  one  can  see  from 
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Table  5,  the  greatest  adjusted  R-squared,  for  just  one  year,  was  43%.  The  rest  of  the 
results  using  May  teleconnections  versus  June- August  CDDs  are  listed  in  Table  5.  The 
results  are  not  conducive  to  a  predictive  model,  so  data  mining  techniques  were  used  for 
further  exploration. 


Table  5.  P-value  and  adjusted  R-squared  from  the  ANOVA  table  for  the  14  locations. 
Linear  regression  used  May  teleconnections  and  June- August  CDD’s. 


City 

ATL 

CHI 

CIN 

DFW 

DM 

LV 

MEM 

p-val 

<0.0001 

0.085 

0.005 

0.001 

0.110 

0.002 

0.002 

Adj  Rz 

0.426 

0.119 

0.131 

0.240 

0.329 

0.097 

0.310 

City 

MIN 

NYL 

PHI 

POR 

SAC 

TUC 

WPAFB 

p-val 

0.005 

0.139 

0.001 

0.002 

0.002 

0.016 

0.062 

Adj  Rz 

0.289 

0.285 

0.053 

0.314 

0.290 

0.205 

0.114 
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V.  Tree-Based  Statistical  Models 


Overview 

Tree-based  statistical  models  are  a  recent  development  in  statistics  that  have  been 
applied  to  prediction  problems  in  widely  diverse  fields  of  endeavor,  but  are,  as  of  yet,  not 
well  known  in  the  atmospheric  sciences  (Burrows  and  Assel,  1992).  This  study  uses 
classification  and  regression  trees  (CART)  analysis  to  explore  the  data.  CART  is  a  tree- 
based  statistical  procedure  for  application  to  classification  and  regression  problems. 
Breiman  et  al.  (1984)  found  that  error  rates  of  CART  solutions  are  nearly  always  as  low 
or  lower  than  solutions  by  linear  regression.  Error  rates  are  also  significantly  lower  for 
problems  involving  complex  predictands  and  many  predictors  (Burrows  and  Assel, 

1992). 

From  a  database  of  predictand  cases  and  accompanying  predictors,  CART 
establishes  decision  trees  that  are  a  classification  of  categorical  predictands  or  a 
regression  of  continuous  predictands.  A  decision  tree  consists  of  a  tree-like  structure  of 
binary  decisions  rules.  At  each  decision  point  (node)  a  case  will  branch  either  to  the  left 
or  right  based  on  a  test  against  a  specific  predictor  value,  and  will  continue  branching 
until  a  final  point  (terminal  node)  is  reached.  CART  uses  input  parameters  of  tree  length, 
parent  node  size,  and  child  node  size  to  determine  the  number  of  nodes.  It  uses  the  inputs 
to  search  for  the  tree  that  provides  the  least  error  when  used  with  independent  data.  In 
this  study,  the  independent  data  are  represented  by  the  ten  years  of  data  that  was  withheld 
from  the  original  dataset.  After  a  tree  is  calculated,  a  process  of  eliminating  terminal 
nodes  (pruning)  is  accomplished  to  make  the  tree  a  more  effective  model.  Categorical 
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predictors  are  used  in  classification  tree  analysis  and  continuous  predictors  are  used  in 
regression  tree  analysis  (Burrows  and  Assel,  1992). 

The  goal  of  this  study  at  this  point  was  to  produce  a  predictive  tool,  using  CART 
analysis,  for  seasonal  HDDs  or  CDDs  that  would  be  more  accurate  than  using  the 
climatological  normals  or  simple  frequency  distributions  of  occurrences. 

Classification  Tree  Analysis 

Classification  trees  were  the  first  tree-based  models  attempted.  To  use  this  model 
the  data  had  to  be  categorized  into  a  nominal  data  format.  Data  were  categorized  into 
thirds,  using  categories  of  above  normal,  normal,  and  below  nonnal.  Each  HDD  and 
CDD  vector  was  sorted  into  ascending  order,  then  the  separation  values  between  the 
upper  third,  the  middle,  third  and  the  lower  third  were  calculated.  All  data  between  the 
calculated  figures  in  each  vector  were  considered  in  the  specific  group  of  above  nonnal, 
normal,  or  below  normal  categories. 

An  example  of  such  a  tree  is  shown  in  Figure  16.  A  brief  explanation  of  this 
classification  tree  provides  the  reader  a  general  idea  of  the  tree’s  structure.  This  tree  was 
computed  using  data  from  Minneapolis,  Minnesota,  using  May  teleconnections  and 
categorized  June-August  CDDs.  Specific  “parent”  and  “child”  node  inputs  are  user 
provided.  In  this  tree  the  parent  node  of  any  split  must  have  at  least  n=6,  n  being  the 
number  of  data  points  (years)  in  the  node,  and  the  child  node  must  be  at  least  n=2.  If 
these  conditions  are  not  met,  the  node  will  stop  splitting.  For  example,  node  5  has  n=6, 
but  the  program  calculated  that  if  this  node  was  split,  one  of  the  resulting  splits  would  not 
be  at  least  n=2.  The  split  was  therefore  stopped.  However,  in  node  4,  with  n>6,  a  split 
was  accomplished  because  the  child  nodes  were  both  at  least  n=2. 
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To  reach  a  specific  node,  a  series  of  conditions  must  be  true.  For  example:  to  get  to  node 
10  the  EP  must  be  less  than  or  equal  to  0.45,  the  EAWR  must  be  greater  than  or  equal  to  - 
0.2,  and  EA_JET  must  be  greater  than  -0.35. 


V1IN 


Node  0  1 

|  Category  %  n  | 

■  0 

32.50  13 

1 

37.50  15 

■  2 

30.00  12 

Tot  si 

fl  00.001 40 

.  . ET 

EP 


<=0.45000000000000001 


Nodel 

Category  %  n 

■  0 

48.15  13 

1 

29.63  8 

■  2 

22.22  6 

Total 

(67.50)27 

~1 - d 

EAV\R 


mprovement=0 .0763 


>0.45000000000000001 


Node  2 

Category  % 

n 

■  0 

0.00 

0 

1 

53.85 

7 

■  2 

46.15 

6 

Tot  si 

(32.50)13  | 

i - d 

NP 


Improvements  .0626 


Improvements  .0770 


<=-0.20000000000000001 


Node  3 

Category 

%  n 

■  0 

18.18  2 

1 

45.45  5 

■  2 

36.36  4 

Tots! 

(27.50)11 

EAV\R 

mpro  vement=0 .0567 


=-0.20000000000000001 


Node  4 

Category 

%  n 

■  0 

68.75  11 

1 

18.75  3 

■  2 

12.50  2 

Totsl 

(40.00)16 

□ 

EA_JET 

mprovement=0 .0456 


<=-0.15000000000000002 


Node5 

Category  % 

n 

■  0 

0.00 

0 

1 

16.67 

1 

■  2 

83.33 

5 

Tot  si 

(15.00) 

6 

=-0.15000000000000002 


Node  6 

Category  % 

n 

■  0 

0.00 

0 

1 

85.71 

6 

■  2 

14.29 

1 

Tots! 

(17.50) 

7 

<=-1  =-1  <=0050000000000000003  >0.050000000000000003 


Figure  15.  Example  of  a  classification  tree.  This  example  is  a  tree  run  with  data  from 
Minneapolis  using  May  teleconnections  and  June-August  CDDs.  Three  categories  are 
present;  2  is  above  nonnal,  1  is  nonnal,  and  0  is  below  normal. 
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Extracting  any  predictive  infonnation  proved  difficult  in  the  classification  trees.  For 
example:  node  9  shows  nine  remaining  data  points  from  the  original  13  in  the  below 
normal  category  (69%),  with  nine  of  the  10  data  points  in  that  node  (90%).  This  gives  a 
total  probability  of  ending  in  node  9  of  62%  below  normal,  which  is  not  a  bad  result. 
However,  the  best  probabilities  calculated  from  the  trees  were  in  the  mid  60%  range  and 
only  in  a  few  nodes.  In  addition,  specific  conditions  needed  to  exist  to  arrive  in  the 
nodes.  In  this  example,  node  9  only  incorporates  25%  of  the  total  data.  This  result 
created  difficulty  in  creating  a  tool  that  would  efficiently  incorporate  the  whole  dataset. 

It  didn’t  appear  there  was  any  likelihood  of  creating  any  useful  predictive  tools  from 
classification  trees,  so  a  different  form  of  CART  analysis  was  accomplished. 

Regression  Tree  Analysis 

The  regression  tree  differs  from  the  classification  tree  in  that  it  uses  continuous 
data  instead  of  classified  nominal  data.  Figure  17  is  an  example  of  such  a  regression  tree. 
A  brief  explanation  of  this  example  tree  will  give  the  reader  a  general  idea  of  the  tree’s 
structure.  This  tree  was  computed  with  the  same  data  from  Minneapolis,  Minnesota, 
using  May  teleconnections  and  June-August  CDDs.  The  user  inputs  three  initial 
constraints  before  a  tree  can  be  grown.  The  inputs  are  maximum  number  of  levels, 
minimum  number  of  data  points  necessary  in  the  parent  node  before  a  split  can  be 
performed,  and  the  minimum  number  of  data  points  in  the  child  node  before  a  split  can  be 
performed.  In  the  example  shown  in  Figure  17,  the  input  values  are  10  maximum  levels 
of  the  tree,  6  minimum  data  points  in  the  parent  node,  and  a  minimum  of  2  data  points  in 
a  child  node.  The  regression  tree  starts  with  a  beginning  node,  node  0.  In  this  example, 
node  0  represents  the  summed  CDDs  from  June-August  for  Minneapolis.  It  displays  the 
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mean  standard  deviation,  number  of  data  points,  and  the  percent  of  data  that  is  in  that 
particular  node.  Each  parent  node  is  split  into  two  child  nodes  until  the  splitting  is 
stopped  by  user  specified  inputs. 


NLM_MIN 


EAWR 


RvJA 


lmprovement=4C60.1 762 


mprovement=2450 .2500 


PNA 

lnprovement=231 4.8043 


-=-0.94999999999999996  =-0 .94999999999999996 


Node  7 

Node  8 

Mean 

372.5000 

Mean 

543.7333 

Std.  Dev. 

1 1 9.2882 

Std.  Dev. 

83.1 217 

n 

4 

n 

15 

% 

10.00 

% 

37.50 

Predicted 

372.5000 

Predicted 

543.7333 

i  q 

Figure  16.  Example  of  a  regression  tree.  Shown  is  a  tree  run  with  data  from  Minneapolis 
using  May  teleconnections  and  June-August  CDDs. 
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The  fundamental  idea  to  make  a  split  is  to  select  each  split  of  a  node  so  that  the  data  in 
each  of  the  child  nodes  are  “purer”  than  the  data  in  the  parent  node  (Breiman  et  ah, 

1984).  For  continuous  target  variables,  the  least-squared  deviation  (LSD)  impurity 
measure  is  used.  The  LSD  measure  (R(t))  is  the  within-node  variance  for  node  t,  and  is 
equal  to  the  resubstitution  estimate  of  risk  for  the  node.  It  is  defined  as: 

m  =  -T -2><  -m  f  (5) 

N{t)  tf 

where  N(t)  is  the  number  of  cases  in  the  node  t,  y.  is  the  value  of  the  target  variable 
(location  HDDs  or  CDDs),  and  v,  is  the  mean  for  node  t.  The  LSD  criterion  function  for 
split  5  at  node  t  is  defined  as: 

<f>(s,t)  =  R(t)-pLR(tL)-pRR(tR)  (6) 

where  p,  is  the  proportion  of  cases  in  t  sent  to  the  left  child  node,  pR  is  the  proportion 
sent  to  the  right  child  node,  and  t,  and  tR  are  the  nodes  created  by  the  split  5. 

The  software  runs  all  possible  splits  on  the  node  and  splits  the  node  at  the  location  of  the 
largest  decrease  in  impurity.  This  value  is  shown  on  the  tree  as  the  “improvement”.  The 
process  is  then  repeated  at  each  node  (SPSS,  2001). 

Application  of  Regression  Tree  Analysis 

The  goal  of  this  study  was  to  come  up  with  a  predictive  tool  for  HDDs/CDDs  using 
teleconnection  indices.  To  test  the  process  at  14  locations,  May  teleconnections  and 
June-August  CDDs  were  used.  First,  a  goodness  of  fit  test  for  normality  was 
accomplished  on  the  CDDs  in  each  city  using  the  Shapiro-Wilk  test.  The  results  are 
shown  in  Table  6. 
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Table  6.  Shapiro-Wilk  goodness  of  fit  test  for  nonnality.  A  p-value  >  0.05  shows  a 
normal  distribution.  DFW  and  TUC  did  not  pass  the  test. 


City 

ATL 

CHI 

CIN 

DFW 

DM 

LV 

MEM 

W-S  test 

0.99 

0.29 

0.26 

0.01 

0.42 

0.55 

0.31 

City 

MIN 

NYL 

PHI 

POR 

SAC 

TUC 

WPAFB 

W-S  test 

0.84 

0.22 

0.69 

0.25 

0.93 

0.004 

0.12 

A  p-value  >  0.05  in  the  Shapiro-Wilk  test  indicates  a  nonnal  distribution.  Dallas,  and 
Tucson  did  not  pass  the  normality  test,  however,  only  one  data  point  for  Tucson  and  two 
for  Dallas  created  a  non-nonnal  distribution,  so  the  exploration  for  a  predictive  outcome 
continued  with  nonnality  assumed  for  all  locations.  With  the  goal  of  coming  up  with  a 
predictive  tool  that  is  better  than  the  climatological  nonnals  or  simple  frequency 
distributions  in  mind,  it  was  decided  to  create  a  95%  prediction  interval  to  create  a  range 
of  CDDs.  The  mean  and  standard  deviation  from  each  tree  node  was  used  to  create  a 
95%  prediction  interval  which  is  defined  as: 


x±t, 


0.025,/z — 1 


1  +  1 


(7) 


V  n 

where  x  is  the  mean  from  the  calculated  node,  t  is  the  critical  value  for  a  t-distribution,  n 
is  the  number  of  data  points  in  the  node,  and  s  is  the  sample  standard  deviation. 

The  next  step  in  tree-structured  statistics  is  to  prune  the  tree.  Pruning  consists  of 
eliminating  the  terminal  nodes  necessary  to  create  the  best  effective  tree.  How  to  prune  a 
tree  depends  on  the  data  being  analyzed.  The  pruning  criteria  for  the  trees  in  this  study 
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were  calculated  during  the  verification  of  the  process  created.  The  independent  data  (ten 
years)  withheld  from  the  original  data  were  run  through  the  trees.  The  teleconnections 
for  each  year  were  run  through  the  tree  to  calculate  which  node  was  the  terminating  node, 
then  the  specific  year’s  CDDs  were  checked  to  see  if  they  fell  within  the  created 
prediction  interval  from  the  same  node.  During  the  verification  process  the  data  were  run 
through  multiple  models  with  different  criteria  for  pruning  the  tenninal  nodes.  It  was 
found  that  a  node  with  n<6  needed  to  be  pruned.  As  shown  in  Figure  18,  those  nodes  of 
the  tree  with  n<6  are  terminated. 

Results  of  model  verification  are  shown  in  Table  7.  Verification  results  for  the 
individual  locations  were  between  80%  and  100%  with  an  overall  88%  verification  rate. 


Table  7.  Percentage  of  CDDs  that  were  in  the  predicted  range  after  verification  data 
were  run  through  the  trees.  An  overall  verification  rate  of  88%  was  achieved. 


City 

ATL 

CHI 

CIN 

DFW 

DM 

LV 

MEM 

%  CDDs  in  final  node 

80 

87.5 

80 

100 

90 

80 

80 

City 

MIN 

NYL 

PHI 

POR 

SAC 

TUC 

WPAFB 

%  CDDs  in  final  node 

90 

80 

90 

100 

100 

90 

90 

Results  vs.  Frequency  Distribution 

The  goal  of  this  study  was  to  create  a  predictive  tool  that  was  better  than  the 
climatological  standard  normals  or  simple  frequency  distributions  of  CDDs/HDDs 
occurrences.  Table  8  shows  comparisons  of  the  simple  frequency  distributions  of 
occurrences  of  Minneapolis  June-August  CDDs  and  the  created  95%  prediction  interval 
for  valid  nodes  of  the  tree  shown  in  Figure  18.  The  new  calculated  forecast  ranges  are 
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NLM_MIN 


I  Node  0  1 

Mean 

599.1  000 

Std.  Dev. 

1  50.6541 

n 

40 

% 

1  00.00 

Predicted 

599.1  000 

. i 

EP 


Improvement =399 7 .6445 


<=0.1  5000000000000002 


Node  1 

Mean 

541 .9091 

Std.  Dev. 

1  38.1  944 

n 

22 

% 

55.00 

Predicted 

541 .9091 

EAWR 


-0 .1 5000000000000002 


Node  2 

Mean 

669.0000 

Std.  Dev. 

1  38.0989 

n 

18 

% 

45.00 

Predicted 

669.0000 

T 


PNA 


lmprovement=231  4.8043 


-=-0.94999999999999996  =-0 .94999999999999996 


<=1  .1  499999999999999  >1  .1  499999999999999 


PT 


knprovement=655.7798 


<=0  90000000000000002  =0 .90000000000000002 


Node  11 

V.  Node  1 2  / 

Mean 

503.0000 

M^sn 

62^5000 

Std.  Dev. 

49.7654 

Std.  cfesc 

/^6.3640 

n 

11 

n 

\2 

% 

27.50 

%/ 

^SsOO 

Predicted 

503.0000 

J^fedicted 

627.50QD 

Figure  17.  Example  of  a  pruned  regression  tree.  This  example  tree  was  run  with  data 
from  Minneapolis  using  May  teleconnections  and  June-August  CDDs.  The  pruned 
branches  are  crossed  out  if  n<6  in  any  node. 
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shaded  in  gray.  The  reduction  of  the  original  CDD  range  is  quantified  by  calculating  a 
ratio  of  the  new  forecast  range,  for  the  individual  nodes,  to  the  original  CDD  range  and 
subtracting  this  value  from  one.  This  reduction  percentage  is  multiplied  by  the 
climatological  frequency  distribution  for  each  individual  node  to  obtain  an  expected 
forecast  range  reduction  percentage.  The  individual  node  expected  forecast  range 
reduction  percentages  are  summed  to  obtain  a  total  expected  range  reductions.  This 
reduction  percentage  can  be  viewed  as  the  total  expected  forecast  range  reduction  over 
climatology. 

As  an  example  from  Table  8:  the  total  range  of  summed  CDDs  are  broken  into  14 
ranges  between  248  and  909,  with  the  frequency  distributions  of  occurrences  for  the  CDD 
ranges  in  the  next  column.  The  calculated  percentage  the  range  is  reduced  in  node  1 1  is 
the  ratio  of  the  new  forecast  range  (626-390  =  236  CDDs)  with  the  total  range  (661 
CDDs).  Therefore,  in  node  1 1  the  range  is  reduced  (1 -(23 6/661))  or  64%. 
Climatologically,  over  the  40-year  period  of  record,  the  calculated  CDDs  are  in  node  1 1 
12.5%  of  the  time.  The  product  of  the  reduced  range  (64%)  and  the  observed 
climatological  frequency  of  occurrences  per  node  (12.5%)  shows  an  expected  forecast 
range  reduction  for  node  1 1  of  8%  over  climatology.  Summing  all  of  the  individual  node 
expected  forecast  range  reduction  percentages  shows  a  total  expected  forecast  range 
reduction  of  36.45%  over  climatology  for  Minneapolis.  The  expected  forecast  range 
reduction  for  all  14  locations  are  shown  in  Table  9,  with  an  overall  expected  forecast 
range  reduction  of  35.7%  over  climatology.  This  value  varies  from  16.8%  for  Cincinnati 
to  58.9%  for  Tucson. 
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Table  8.  Expected  forecast  range  reduction.  Prediction  intervals  computed  in  each  node 
are  shown  in  gray.  The  percentage  the  range  is  reduced  is  multiplied  by  the 
climatological  frequency  distribution  to  obtain  an  expected  forecast  range  reduction.  The 
individual  node  expected  forecast  range  reduction  percentages  are  summed  to  obtain  a 
total  expected  forecast  range  reduction  over  climatology,  36.45%  in  this  case. 


Node 


MIN  CDD 

Frequency 

Distribution 

0 

1 

2 

4 

5 

8 

9 

11 

14 

>248<=295 

0.0250 

0 

0 

0 

0 

0 

0 

0 

0 

0 

>295<=342 

0.0000 

1 

1 

0 

1 

0 

0 

0 

0 

0 

>342<=390 

0.0500 

1 

1 

0 

1 

0 

0 

0 

0 

0 

>390<=437 

0.0250 

1 

1 

1 

1 

1 

1 

1 

1 

0 

>437<=484 

0.1500 

1 

1 

1 

1 

1 

1 

1 

1 

1 

>484<=531 

0.0750 

1 

1 

1 

1 

1 

1 

1 

1 

1 

>531 <=579 

0.2000 

1 

1 

1 

1 

1 

1 

1 

1 

1 

>579<=626 

0.0750 

1 

1 

1 

1 

1 

1 

1 

1 

1 

>626<=673 

0.1000 

1 

1 

1 

1 

1 

1 

1 

0 

1 

>673<=720 

0.1000 

1 

1 

1 

1 

1 

1 

0 

0 

0 

>720<=767 

0.0750 

1 

1 

1 

1 

1 

1 

0 

0 

0 

>767<=815 

0.0250 

1 

1 

1 

0 

1 

0 

0 

0 

0 

>815<=862 

0.0500 

1 

1 

1 

0 

1 

0 

0 

0 

0 

>862<=909 

0.0250 

1 

0 

1 

0 

1 

0 

0 

0 

0 

>909 

0.0250 

0 

0 

1 

0 

1 

0 

0 

0 

0 

Percent  reduction  in 
forecast  range  per 
individual  node  (%) 

7 

14 

22 

29 

22 

43 

57 

64 

64 

Climatological  frequency 
distribution  of  occurrences 
per  node  (40  years) 

0.000 

0.075 

0.075 

0.100 

0.375 

0.050 

0.050 

0.125 

0.150 

Total 

Expected  forecast  range 
reduction.  (%) 

0.00 

1.05 

1.65 

2.90 

8.25 

2.15 

2.85 

8.00 

9.60 

36.45 
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Table  9.  The  total  expected  forecast  range  reduction  over  climatology  for  the  14  forecast 
locations.  The  overall  average  expected  forecast  range  reduction  for  this  example  is 

35.7%  over  climatology. 


City 

ATL 

CHI 

CIN 

DFW 

DM 

LV 

MEM 

Expected  forecast 
range  reduction  (%) 

39.7 

33.3 

16.8 

32.2 

48.2 

46.4 

34.5 

City 

MIN 

NYL 

PHI 

POR 

SAC 

TUC 

WPAFB 

Expected  forecast 
range  reduction  (%) 

36.5 

39.4 

36.1 

24.9 

30.2 

58.9 

22.7 
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VI.  Conclusions  and  Recommendations 


Conclusions 

This  study  has  introduced  a  new  technique  to  significantly  increase  the  accuracy 
of  seasonal  long-range  temperature  forecasts.  It  statistically  explored  teleconnection 
indices  and,  using  a  tree-based  statistical  regression,  created  a  predictive  tool  for  future 
CDDs  and  HDDs  summed  over  three  months. 

Temperature  data  were  gathered  from  14  U.S.  locations  in  order  to  represent  most 
of  the  climate  regimes  across  the  country  (Objective  1).  HDDs  and  CDDs  were 
calculated  using  the  temperature  data  gathered  to  use  as  predictor  variables  (Objective  2). 
Teleconnection  indices  from  the  13  most  significant  Northern  Hemispheric 
teleconnections  and  the  Southern  Oscillation  Index  in  the  Southern  Hemisphere  were 
gathered  to  use  as  predictand  variables  (Objective  3).  Ten  years  of  data  were  then 
removed  for  independent  verification  of  the  technique  created  (Objective  4). 

Linear  regression  analysis  was  accomplished  on  the  data  using  teleconnections 
from  May  and  summed  CDDs  from  June-August.  Valid  models  were  found  during  the 
analysis,  but  the  amount  of  variance  of  the  predictand  explained  by  the  linear  regression 
was  rarely  greater  than  35%,  in  which  case,  creating  a  predictive  tool  would  be  difficult 
(Objective  5). 

Tree-based  analysis  was  accomplished  on  the  data  (Objective  6),  first  using 
classification  tree  analysis;  however,  extracting  any  predictive  infonnation  also  proved 
difficult  with  this  type  of  approach.  Regression  tree  analysis  was  then  accomplished  on 
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the  data.  Trees  were  created  and  the  predicted  mean  and  standard  deviations  were  used  to 
created  a  method  for  predicting  seasonal  CDDs  and  HDDs. 

This  new  technique  creates  a  range  of  HDDs/CDDs  that  is  significantly  more 
accurate  than  simple  frequency  distributions  of  occurrences.  The  predicted  mean  and 
standard  deviations  from  the  regression  tree  output  were  used  to  calculate  95%  prediction 
intervals  for  each  of  the  nodes.  Teleconnections  were  run  through  the  trees  to  compute  a 
predicted  node,  and  then  the  new  interval  for  the  predicted  node  was  used  as  the 
predictive  range  for  the  HDDs/CDDs  for  the  particular  forecast  months  (Objective  7). 

This  new  model  verified,  using  10  years  of  independent  data  withheld  from  the 
original  data  set  (Objective  8),  at  an  excellent  88%  overall  verification  rate  with  3  of  the 
12  cities  verifying  at  100%.  Two  other  cities,  which  verified  at  the  90%  significance 
level,  failed  in  the  randomly  selected  year  of  1988.  This  year  is  a  well-known  El  Nino 
year  and  record  temperatures  were  experienced  in  some  parts  of  the  U.S.  The  summed 
CDDs  for  Minneapolis  and  WPAFB  in  1988  fell  outside  the  range  of  the  original  data  set. 
Extrapolation  of  the  model  to  fit  the  data  outside  the  range  of  the  original  data  set  was  not 
accomplished  because  the  new  fitted  relationship  may  not  have  been  valid  for  such 
outlier  values.  Had  the  numbers  for  1988  been  in  the  original  data  set,  the  results  may 
have  been  even  better  than  they  were  with  possibly  two  more  cities  verifying  at  100%. 

An  expected  range  reduction  percentage  over  climatology  was  created  from  the 
calculated  ranges.  An  average  expected  forecast  range  reduction  percentage  of  35.7% 
was  found  in  this  study. 

The  question  of  spatial  homogeneity  arose  during  this  study,  but  the  scope  of  this 
study  could  not  focus  on  the  aspect  of  spatial  homogeneity.  However,  because  WPAFB 
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was  included  in  the  study,  two  cities,  Cincinnati  and  WPAFB,  were  in  close  enough 
spatial  proximity  of  each  other  to  investigate  the  spatial  homogeneity  of  the  created 
prediction  trees.  Cincinnati  verification  data  were  run  through  the  WPAFB  trees  and  the 
results  were  comparable  to  the  WPAFB  results.  Additionally,  WPAFB  verification  data 
were  run  through  the  Cincinnati  trees  and  the  results  were  comparable  to  the  Cincinnati 
results.  These  results  show  spatial  connections  between  the  computed  trees  (Objective 
9). 

Overall  this  study  attempted  to  improve  upon  the  methods  currently  used  to 
produce  long-range  forecasts  of  temperature  over  the  U.S.  Excellent  results  were 
achieved  and  predictive  tree  tools  were  created  which  are  deemed  ready  for  users  to  use 
now  for  long-range  temperature  forecasts.  It  is  the  conclusion  of  this  study  that  this 
innovative  method  works.  It  is  also  concluded  that  this  method  may  be  used  to  predict 
multiple  atmospheric  variables,  well  in  advance,  for  most  locations  within  the  Northern 
Hemisphere. 

Recommendations 

This  study  created  a  new  technique  in  the  way  we  can  analyze  atmospheric 
parameters.  The  hope  is  that  this  study  will  be  a  stepping-stone  to  future  research  to  fully 
understand  the  magnitude  of  this  type  of  analysis.  Continuation  of  research  on  this  study 
should  be  according  to  the  following: 

1 .  Try  to  understand  how  the  regression  tree  analysis  relates  to  physical  atmospheric 
synoptic  circulation  patterns.  Understanding  further  why  the  regression  tree 
analysis  splits  where  it  does  and  why  it  uses  the  teleconnections  in  the  order  that  it 
does. 
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2.  Try  to  extract  the  model  from  the  software  in  order  to  fully  automate  the 

technique.  The  teleconnections  are  currently  run  through  the  trees  manually  to 
calculate  the  terminating  node.  The  new  prediction  intervals  are  created  after  the 
data  calculated  in  the  trees  nodes  are  manually  entered  into  a  statistical 
spreadsheet.  The  complete  process  needs  to  be  automated. 

The  research  opportunities  using  this  process  are  limitless.  Currently  DoD  is 
looking  for  long-range  seasonal  forecasts  for  parameters  over  Afghanistan.  This  method 
could  be  used  anywhere  in  the  Northern  Hemisphere.  This  method  could  also  be  used  to 
predict  any  parameter,  to  include  the  vital  ones  necessary  in  a  wartime  scenario,  such  as 
cloud  cover,  precipitation,  and  visibility.  This  information  could  revolutionize  long- 
range  prediction  efforts  to  help  with  humanitarian  aid  operations  for  the  timing  and 
movement  of  supplies. 
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Appendix:  Regression  Trees 


The  appendix  contains  the  regression  trees  used  in  this  study,  which  can  be  used  as  a 
predictive  tool.  They  were  created  using  May  teleconnection  indices  and  summed  CDDs 
for  June- August.  Using  the  predicted  mean  and  standard  deviation,  prediction  intervals 
are  made  for  each  valid  node  (n>5).  Overall,  this  prediction  interval  is  90%  likely  to 
contain  the  predicted  HDDs/CDDs  for  the  upcoming  June- August  with  a  35.7%  overall 
decrease  in  expected  range  over  climatology. 
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Figure  18  (continued).  Atlanta  regression  tree. 
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Figure  19.  Chicago  regression  tree. 
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Figure  20.  Cincinnati  regression  tree. 


53 


Node  6 

Mean 

842.1000 

Stcl.  Dev. 

113.8239 

n 

10 

% 

25.00 

Predicted 

842.1000 

SCA 

lrrprovemert=1 621 .9704 


<=-0.40000000000000002  >-0.40000000000000002 


Node  11 

Node  12 

Meer  940.7500 

Mean  776.3333 

Std.Dev.  60.7419 

Std.Dev.  90.1724 

n  4 

n  6 

%  10.00 

%  15.00 

Predated  940.7500 

Predicted  776.3333 

□ 

SCA 

lmprovement=639.4083 


<=0.1 0000000000000001  >0.1 0000000000000001 


Node  1 5 

Node  16 

Mean  822.5000 

Mean  684.0000 

Std.Dev.  70.8919 

Std.Dev.  1.4142 

n  4 

n  2 

%  10.00 

%  5.00 

Predicted  822.5000 

Ffedicted  684.0000 

Figure  20  (continued).  Cincinnati  regression  tree. 
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Figure  20  (continued).  Cincinnati  regression  tree. 
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Figure  2 1 .  Dallas-Fort  Worth  regression  tree. 
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Figure  22.  DeMoines  regression  tree. 
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Figure  22  (continued).  DeMoines  regression  tree. 
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Figure  22  (continued).  DeMoines  regression  tree  continued. 
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Figure  23.  Las  Vegas  regression  tree. 
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Figure  23  (continued).  Las  Vegas  regression. 
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Figure  24.  Memphis  regression  tree. 
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Figure  25.  Minneapolis  regression  tree. 
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Figure  25  (continued).  Minneapolis  regression  tree. 
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Figure  26.  New  York,  LaGuardia  regression  tree. 
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Figure  27.  Philadelphia  regression  tree. 
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Figure  27  (continued).  Philadelphia  regression  tree. 
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Figure  28.  Portland  regression  tree. 
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Figure  29.  Sacramento  regression  tree. 
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Figure  29  (continued).  Sacramento  regression  tree. 
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Figure  30.  Tucson  regression  tree. 
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Figure  30  (continued).  Tucson  regression  tree. 
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Figure  3 1 .  WPAFB  regression  tree. 
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Figure  31  (continued).  WPAFB  regression  tree. 
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